13 | Large Language Models

Max Pellert (https://mpellert.at)

Deep Learning for the Social Sciences

Main types of transformers

Transformers can be grouped into three categories according to the form of the input and output data

In a problem such as sentiment analysis, we take a sequence of words as input and provide a single variable representing the sentiment of the text, for example happy or sad, as output

Here a transformer is acting as an encoder of the sequence

Main types of transformers

Other problems might take a single vector as input and generate a word sequence as output, for example if we wish to generate a text caption given an input image

In such cases the transformer functions as a decoder, generating a sequence as output

Finally, in sequence-to-sequence processing tasks, both the input and the output comprise a sequence of words, for example if our goal is to translate from one language to another

In this case, transformers are used in both encoder and decoder roles

Decoder

One well-known model series by OpenAI:

GPT … Generative Pretrained Transformer

With decoders, the goal is to use the transformer architecture to construct an autoregressive model with the conditional distributions \(p(x_n |x_1 , \ldots , x_{n−1} )\) being learned from data

Decoder

The model takes as input \(n − 1\) tokens and its corresponding output represents the conditional distribution for token \(n\)

If we draw a sample from this distribution then we have extended the sequence to \(n\) tokens and this new sequence can be fed back through the model to give a distribution over token \(n + 1\) and so on

The architecture of a GPT model consists of a stack of transformer layers that take a sequence \(x_1 , \cdots , x_N\) of tokens, each of dimensionality \(D\), as input and produce a sequence \(\tilde{x}_{1} , \cdots , \tilde{x}_{N}\) of tokens, again of dimensionality \(D\), as output

We offset the input sequence by one by adding a special <start> token

Decoder

Each output needs to represent a probability distribution over the dictionary of tokens at that time step, and this dictionary has dimensionality \(K\) whereas the tokens have a dimensionality of \(D\)

We therefore make a linear transformation of each output token using a matrix \(W^{(p)}\) of dimensionality \(D × K\) followed by softmax:

\[Y = \mathbf{Softmax}[\tilde{X}W^{(p)}]\]

The result can be interpreted as probability distributions over the entries of the vocabulary

Causal (or masked) attention

We have to ensure that the network is not able to “cheat” by looking ahead in the sequence, i.e. a token is not allowed to attend tokens that follow it

We set to zero all of the attention weights that correspond to a token attending to any later token in the sequence

This is achieved by setting the corresponding pre-activation values to −∞ so that the softmax evaluates to zero for those outputs and also takes care of the normalization across the non-zero outputs

Without that, the model would be unable to generate new sequences since the subsequent token by definition is not available at test time

Attention weights corresponding to the red elements are set to zero, e.g. for “across”, the output can depend only on the input tokens “<start>”, “I” and “swam”

Decoding Strategies

The output of a decoder transformer is a probability distribution over values for the next token in the sequence, from which one particular token must be chosen to extend the sequence

There are several options how to choose

One obvious approach, called greedy search, is simply to select the token with the highest probability

That makes the model deterministic, in that a given input sequence always generates the same output sequence

Note that simply choosing the highest probability token at each stage is not the same as selecting the highest probability sequence of tokens

Decoding Strategies

One technique that has the potential to generate higher probability sequences than greedy search is called beam search

Instead of choosing the single most probable token value at each step, we maintain a set of \(B\) hypotheses, where \(B\) is called the beam width, each consisting of a sequence of token values up to step \(n\)

We then feed all these sequences through the network, and for each sequence we find the \(B\) most probable token values, thereby creating \(B^2\) possible hypotheses for the extended sequence

This list is then pruned by selecting the most probable \(B\) hypotheses according to the total probability of the extended sequence

Decoding Strategies

Thus, the beam search algorithm maintains \(B\) alternative sequences and keeps track of their probabilities, finally selecting the most probable sequence amongt those considered

The probability of a sequence is obtained by multiplying the probabilities at each step of the sequence: since these probability are always less than or equal to one, a long sequence will generally have a lower probability than a short one, biasing the results towards short sequences

For this reason the sequence probabilities are generally normalized by the corresponding lengths of the sequence before making comparisons

Decoding Strategies

A problem with greedy search and beam search is that they limit the diversity of potential outputs and the generation process can even become stuck in a loop, where the same sub-sequence of words is repeated over and over again

We can also generate successive tokens simply by sampling from the softmax distribution at each step

However, this can lead to sequences that are nonsensical: This arises from the typically very large size of the token dictionary, in which there is a long tail of many token states each of which has a very small probability but which in aggregate account for a significant fraction of the total probability mass

Decoding Strategies

This leads to a significant chance that the system will make a bad choice for the next token

Alternatively, we can consider only the states having the top \(\mathbf{K}\) probabilities, for some choice of \(K\), and then sample from these according to their renormalized probabilities

A variant of this approach, called top-p sampling or nucleus sampling, calculates the cumulative probability of the top outputs until a threshold is reached and then samples from this restricted set of token states

Decoding Strategies

A “softer” version of top-K sampling is to introduce a parameter \(T\) called temperature into the definition of the softmax function:

\[ y_i = \frac{\mathrm{exp}(a_i/T)}{\sum_{j} \mathrm{exp}(a_j/T)} \]

Then we can sample the next token from this modified distribution

Decoding Strategies

\[ y_i = \frac{\mathrm{exp}(a_i/T)}{\sum_{j} \mathrm{exp}(a_j/T)} \]

When \(T → 0\), the probability mass is concentrated on the most probable state, with all other states having zero probability, and hence this comes very close to greedy selection

For \(T = 1\), we recover the unmodified softmax distribution

As T → ∞, the distribution becomes uniform across all states

By choosing a value in the range \(0 < T < 1\) the probability is concentrated towards the higher values

Encoder

BERT … Bidirectional Encoder Representations from Transformers

A well-known model that triggered many variants

RoBERTa, DeBERTa, ALBERT, DistilBERT, ELECTRA …

Encoder

Models that take sequences as input and produce fixed-length vectors, such as class labels, as output.

The first token of every input string is given by a special token <class>, and the corresponding output of the model is ignored during pre-training

Its role will become apparent when we discuss fine-tuning

The model is pre-trained by presenting token sequences as the input

Encoder

A randomly chosen subset of the tokens, say 15%, are replaced with a special token denoted <mask>

The model is trained to predict the missing tokens at the corresponding output nodes (similar to the masking used in word2vec)

I <mask> across the river to get to the <mask> bank.

The network should predict “swam” at output location 2 and “other” at output location 10

The term “bidirectional” means that the network sees words both before and after the masked word and can use both sources of information to make a prediction (no masking of the attention matrix is needed)

Encoder

Compared to the decoder model, an encoder is less efficient since only a fraction of the sequence tokens are used as training labels

Moreover, an encoder model is unable to generate sequences

Once the encoder model is trained it can then be finetuned for a variety of different tasks

To do this a new output layer is constructed whose form is specific to the task being solved

For a typical text classification task, only the first output position is used, which corresponds to the <class> token

Encoder

If this output has dimension \(D\) then it is transformed by matrix of parameters of dimension \(D × K\), where \(K\) is the number of classes

This is followed by softmax, to give probabilities for each of the classes

If the goal is instead to classify each token of the input string, for example to assign each token to a category (such as person, place, color, …) then the first output is ignored and the subsequent outputs have a shared linear-plus-softmax layer

During fine-tuning all model parameters including the new output matrix are learned by stochastic gradient descent

Encoder-Decoder

Let’s just discuss these briefly for completeness sake

Sequence-to-sequence models as are used for translating between languages (e.g. English - French) are often of this type

We can use a decoder model to generate the token sequence corresponding to the French output, token by token, as discussed previously

The main difference is that this output needs to be conditioned on the entire input sequence corresponding to the English sentence

An encoder transformer can be used to map the input token sequence into a suitable internal representation, which we denote by \(Z\)

Encoder-Decoder

To incorporate \(Z\) into the generative process for the output sequence, we use a modified form of the attention mechanism called cross attention

This is the same as self-attention except that, although the query vectors come from the sequence being generated, in this case the French output sequence, the key and value vectors come from the sequence represented by \(Z\)

Returning to our analogy with a video streaming service: The user would be sending their query vector to a different streaming company who then compares it with their own set of key vectors to find the best match and then returns the associated value vector in the form of a movie

Large Language Models

LLMs are typically based on transformer architectures

They feature billions of parameters

And they are trained on large corpora of text

In the case of chatbots (or more generally, when direct interaction with humans is involved) additional training is usually added (“Reinforcement Learning from Human Feedback”)

Let’s go through these parts and how are they combined together to create a model like ChatGPT

Parameters

BERTBASE had 12 encoders with 12 bidirectional self-attention heads totaling 110 million parameters

BERTLARGE had 24 encoders with 16 bidirectional self-attention heads totaling 340 million parameters

This was considered very large at the time (just a few years ago)!

Today’s LLMS range from a few billions to hundreds of billions parameter (some over a trillion)

There is an ongoing trend not to even disclose that number publicly

Scaling Laws

Training Data

Companies create their constantly growing proprietary datasets (many many TBs of text)

Many practices (often involving private data) are still in a legal grayzone, with many lawsuits worldwide establishing boundaries

There are also a number of public domain datasets

Data quality matters a lot

This is again were your thinking from the social sciences can come in very handy!

Consider for example the effect of standard filtering of training data

Data filtering

“[…] only 16 clusters of excluded documents that are largely sexual in nature (31% of the excluded documents)”

Books3

Chatbots should also know how to follow instructions and cater specifically to their users

InstructGPT

InstructGPT

InstructGPT is basically the forerunner of ChatGPT

The interaction of users with GPT-3 showed that the model often wasn’t good at following instructions that were given to it

Predicting the next token on a webpage from the internet is different from the objective “follow the user’s instructions helpfully and safely” → Misalignment

OpenAI uses a procedure to align the behavior of GPT-3 to the stated preferences of a specific group of people (mostly labelers and researchers)

This procedure lets the model learn from human feedback

Reinforcement Learning from
Human Feedback

This technique uses human preferences as a reward signal to fine-tune a model such as GPT-3

First, human written prompt completions are used to finetune “base” GPT-3 (this serves as some first orientation of the model towards following instructions) creating the initial model

Next, researchers collect a dataset of human-labeled comparisons between different outputs from the model on a larger set of prompts (relative comparisons have a number of methodological advantages)

Then, they train a seperate and in this case smaller reward model (RM) on this dataset to predict which model output the human labelers would prefer

Finally, this RM is used as a reward function to finetune the initial model to maximize the reward

For more details on the general procedures, see for example https://huggingface.co/blog/rlhf

Prompt examples

Prompt examples

These examples come from prompts submitted by users (to the OpenAI playground, where a popup message informed about that possible inclusion into future training data), from use cases users stated on their application to the waiting list or were created by the labelers

The labelers’ demonstrations of outputs are then used for finetuning

ChatGPT

Further refinement of RLHF procedures let to the success of ChatGPT: A “dialogue dataset” was mixed with the InstructGPT data for training

While users prefer the content of those models more, the researchers also report substantial reductions in harmful and untruthful outputs and no drastic performance decreases on many popular NLP benchmark

In principle, RLHF is neutral and always begs the question: “Aligned to whom?”

Basically, now we have all the parts together that are needed to build ChatGPT…

For more on RLHF

For the deep dive into LLMs